### R fundamentals 03: Elements of programming, R Markdown (Jul 24, 18)


In this expression below, we can set the default chunk options for the rest of the document Please note that this chunk itself is defined as include=FALSE, means, please evaluate it but no output of any type please!

library(dplyr) library(knitr)

Data science fundamentals 01: Visualize and explore

HW2

Classwork/Homework:

1. Run ggplot(data = HANES). What do you see? an empty plot is given since R needs Geom functions or aethetics parameters
2. How many rows are in HANES? How many columns? 23x1527
3. What does the DX_DBTS variable describe? Can map the colors of Diabetes Status to points. Here, diabetes Status is indicated on the legend.
4. Make a scatterplot of HDL vs A1C.

# Load the package RCurl
library(RCurl)
Console says: Loading required package: bitops
Console says: 
Console says: Attaching package: 'RCurl'
Console says: The following object is masked from 'package:tidyr':
Console says: 
Console says:     complete
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL 
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
HANES <- na.omit(HANES)
  # Observe the structure
str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
# Load the tidyverse library
library(tidyverse)
  # Make a ggplot
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR)))

ggplot(data = HANES)

str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
  # Make a ggplot with asthetic color for the variable DX_DBTS
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=DX_DBTS))

##**4. Make a scatterplot of HDL vs A1C.**
ggplot(data = HANES) + 
geom_point(mapping = aes(x = HDL, y = A1C, color="blue"))


Classwork/Homework:

What’s gone wrong with this code? Why are the points not blue?

   # Make a ggplot with asthetic shape for the variable DX_DBTS
  ggplot(data = HANES) + 
  geom_point(mapping = aes(x = log(A1C), y = log(UACR), color="blue"))

Which variables in HANES are categorical? Which variables are continuous? How can we see this information? Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Work:

# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL 
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
HANES <- na.omit(HANES)
  # Observe the structure
str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
# Load the tidyverse library
library(tidyverse)
  # Make a ggplot
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR)))

ggplot(data = HANES)

str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
  # Make a ggplot with asthetic color for the variable DX_DBTS
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=DX_DBTS))

## incorrect ggplot with color inside of aes(). (USE FOR Coloring Cat. Variables)
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color="blue"))

## CORRECT ggplot with color OUTside of aes()
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR)), color="blue")

### 4.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
#color to continuous variable
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE))

#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=HSQ_1))
#
#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=AGEGROUP))
#
##size to continuous variable
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), size=GLUCOSE))

##shape to continuous variable with color gradient and regtangle shape
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=23, fill="blue")

##shape to continuous variable with NO fill=color gradient and regtangle shape. Graident on borders only
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=23, fill="blue")

##shape to continuous variable with color gradient and regtangle shape
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=15, fill="blue")

#stroke aesthetic. Modify width of border
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=15, stroke=5)

##map color aesthetic to a boolien function
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color = A1C < 5))

#
#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), size=AGEGROUP))


#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), shape=HSQ_1))
#
#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), shape=AGEGROUP))


Classwork/Homework Answers:

1. What’s gone wrong with this code? Why are the points not blue? aes() must be closed with no color inside. color= is part of mapping=
2.Which variables in HANES are categorical? Which variables are continuous? str(HANES). Or, click on HANES in the environment tab (top-right), and a table representation will open.
3.How can we see this information? click on HANES in the environment tab (top-right), and a table representation will open.
4.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables? For color, different hues indicate different levels. Shapes does not work for some continuous variables however 5. What happens if we map the same variable to multiple aesthetics? Can adjust fill, shape, size of geometric points. 6. What does the stroke aesthetic do? What shapes does it work with? Use the stroke aesthetic to modify the width of the border 7. What happens if we map an aesthetic to something other than a variable name, like aes(colour = A1C < 5)? Maps color aesthetic to a boolien function. True or False in this case for A1C<5

Work: ##


Classwork/Homework:

What’s gone wrong with this code? Why are the points not blue?

   # Make a ggplot with asthetic shape for the variable DX_DBTS
  ggplot(data = HANES) + 
  geom_point(mapping = aes(x = log(A1C), y = log(UACR), color="blue"))

Which variables in HANES are categorical? Which variables are continuous? How can we see this information? Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
Lecture_05_Presentation Slide 43 Work for this HW:

# Load the package RCurl
library(RCurl)
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL 
URL <- paste(URL_text_1,URL_text_2, sep="")
HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
HANES <- na.omit(HANES)
  # Observe the structure
str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
# Load the tidyverse library
library(tidyverse)
  # Make a ggplot
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR)))

ggplot(data = HANES)

str(HANES)
Console says: 'data.frame': 1112 obs. of  23 variables:
Console says:  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
Console says:  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
Console says:  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
Console says:  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
Console says:  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
Console says:  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
Console says:  $ UACR             : num  0.00673 2 3 4 2 ...
Console says:  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
Console says:  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
Console says:  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
Console says:  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
Console says:  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
Console says:  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
Console says:  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
Console says:  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
Console says:  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
Console says:  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
Console says:  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
Console says:  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
Console says:  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
Console says:  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
Console says:  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
Console says:  - attr(*, "na.action")= 'omit' Named int  2 15 16 24 26 28 33 34 35 39 ...
Console says:   ..- attr(*, "names")= chr  "2" "15" "16" "24" ...
  # Make a ggplot with asthetic color for the variable DX_DBTS
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=DX_DBTS))

## incorrect ggplot with color inside of aes(). (USE FOR Coloring Cat. Variables)
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color="blue"))

## CORRECT ggplot with color OUTside of aes()
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR)), color="blue")

### 4.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
#color to continuous variable
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE))

#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=HSQ_1))
#
#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=AGEGROUP))
#
##size to continuous variable
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), size=GLUCOSE))

##shape to continuous variable with color gradient and regtangle shape
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=23, fill="blue")

##shape to continuous variable with NO fill=color gradient and regtangle shape. Graident on borders only
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=23, fill="blue")

##shape to continuous variable with color gradient and regtangle shape
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=15, fill="blue")

#stroke aesthetic. Modify width of border
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color=GLUCOSE), shape=15, stroke=5)

##map color aesthetic to a boolien function
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR), color = A1C < 5))

#

# Make a ggplot with facet grid - AGEGROUP vs DX_DBTS
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR))) + 
facet_grid(AGEGROUP ~ DX_DBTS)

#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), size=AGEGROUP))


#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), shape=HSQ_1))
#
#ggplot(data = HANES) + 
#geom_point(mapping = aes(x = log(A1C), y = log(UACR), shape=AGEGROUP))

What happens if you facet on a continuous variable? What plots does the following code make? What does . do?

Classwork/Homework Answers:

1. What happens if you facet on a continuous variable? Chunk just stops. Cannot facet continuous variables

# Make a ggplot with facet grid - CREATININE~ TRIGLYCERIDE
ggplot(data = HANES) + 
geom_point(mapping = aes(x = log(A1C), y = log(UACR))) + 
facet_grid(CREATININE~ TRIGLYCERIDE)


2.What plots does the following code make? What does . do?

 ggplot(data = HANES) + 
  geom_point(mapping = aes(x = log(HDL), y = log(CHOLESTEROLTOTAL))) +
  facet_grid(AGEGROUP ~ .)

  ggplot(data = HANES) + 
  geom_point(mapping = aes(x = log(UALBUMIN), y = log(GLUCOSE))) +
  facet_grid(. ~ DX_DBTS)

facet_grid(. ~ categorical variable you want with labels on top row) facet_grid(categorical variable you want with labels on left column~ .) think as facet_grid(rows~columns). . can be thought of as an empty place holder for the grid.
3. Derive directed insights from the above plots.


4. Take the first faceted plot in this section:

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Sometimes, colors can obsure data points, making it hard to compare. facet_grid lets you visually compare depending on categorical variables. However, colors aesthetics are useful for visualizing levels of continuous variables. For larger datasets with few categorical variables, facet_grid may be an option to compare the combinations of the categories. Color aesthetics is useful if all the categories wish to be seen simultaneously on one graph.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol argument?
Number of rows and columns. facet_wrap() is generally a better use of screen space than facet_grid() because most displays are roughly rectangular. facet_grid() is most useful when you have two discrete variables, and all combinations of the variables exist in the data.
## 6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
Easier to horizontally read the names of the levels in the columns, and visualize the values continuous variable on rows(vertically)
## —

Selected materials and references

  1. An Introduction to R
  2. knitr package options and more